Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Replace vignettes and examples #11123

Merged
merged 22 commits into from
Jan 15, 2025
Merged

Conversation

david-cortes
Copy link
Contributor

ref #9810
closes #10746

This PR adds a new introductory vignette which replaces most of the previous ones, and modifies the code examples throughout functions aimed at interactive usage to call xgboost() instead of xgb.train().

Motivation

Since the time that XGBoost was first published at CRAN, its adoption and mindshare have risen substantially, to the point that it has become the standard when it comes to boosted decision trees. In this day and age, I don't think the package needs to provide any introduction to the concepts of gradient-boosting, cross-validation, evaluation metrics, and so on - people who use R are already going to be familiar with those, and the things it compares against (like the package 'gbm') have become obsolete by now.

As well, the documentation and tutorials for XGBoost have mostly moved to the online docs - any R-specific documents become outdated rather soon, and are less likely to be seen by a random user. Most of the python examples and guides should in any event work with the R interface with very minimal modifications like dict->list.

Apart from becoming a standard-use library, the features supported by XGBoost have expanded over time, and lots of the materials that were there before, such as the first vignette, contained tips that are not applicable to the current state of the library, like manually one-hot encoding categorical features.

Hence, I decided to remove the previous vignettes and create a new one from scratch, which contains only examples around the usage of the R interface and its conventions.

Help needed

It would be ideal if this vignette could also get added to the online docs.

Thus, I created the vignette as a quarto file (.qmd), which has the option to render to both .html (what CRAN hosts) and .md (which can be included in the .rst files).

Only, getting it to render to .md required building the vignette with jupyter instead of knitr, which in turn requires installs of python, jupyter, ipykernel, and the "ir" kernel that runs R in jupyter, plus registering that kernel in the user-level config for jupyter. By adding that line "jupyter: ir", it additionally makes the default quarto render (e.g. as used by the "knit" button in RStudio) build the .html vignette using jupyter instead of knitr, which is most definitely not going to work out in CRAN servers. I don't know how to solve this.

Would also be nice if some CI job could be auto-building the .md file for the online docs from the .qmd source of the vignette.

(CCing @mayer79 and @jameslamb )

@jameslamb
Copy link
Contributor

jameslamb commented Dec 28, 2024

Thanks for the @. I haven't reviewed the content of this PR closely so am limiting my comment only to the topic of the "Help Needed" part above.

I know nothing about quarto or how to integrate it with CRAN's vignette-building and vignette-checking. Maybe these examples of other packages using it successfully on CRAN could lead to some good reference implementations: https://github.com/search?q=org%3Acran%20path%3A*.qmd&type=code

I can at least tell you how we're addressing these things LightGBM, maybe it will be helpful.

In {lightgbm}, we have a lightweight system that has been working well for us (which you alluded to in #9906 (comment)). Here's how we do this there:

  1. vignettes are written in .Rmd files with headmatter like the following, using {markdown} + {knitr}
output:
  markdown::html_format:
    options:
      toc: true
      number_sections: true
vignette: >
  %\VignetteIndexEntry{Basic Walkthrough}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}

(code link)

  1. The main docs site is built via Sphinx, and the Sphinx build builds a {pkgdown} site with the roxygen docs + vignettes + demos (code link)
  2. those docs are integrated into the readthedocs site using an absolute hyperlink (code link)
  3. these are linked from the "R API" nav item at https://lightgbm.readthedocs.io/en/latest/index.html, which directs to https://lightgbm.readthedocs.io/en/latest/R/reference/

It's a little awkward that the R docs are not technically part of the readthedocs site, but overall this has worked pretty well for us. Some notes:

  • {markdown} + {knitr} is a lighter-weight combination than using {rmarkdown} or {quarto} (fewer required dependencies to install)
  • there are many examples of using .Rmd files + this combination on CRAN, so staying CRAN-compliant requires very little ongoing maintenance
  • {pkgdown} is customizable enough to meet our needs, and it keeps up with changes to the RMarkdown format

Even if you don't pursue this specific mix, I do recommend not checking the .md version of the vignette into source control here. Instead, you could put any commands to generate it as part of the docs-site building into doc/conf.py or doc/Makefile. I shared a link above to how to run custom commands at build-time via conf.py. Here's an example of how to do it via the Makefile generate by sphinx: https://github.com/rapidsai/legate-boost/blob/ba062bf666845f82f5e8ab4690c97db4469ab42f/docs/Makefile#L17-L29

@david-cortes
Copy link
Contributor Author

Thanks for the suggestions, but while LightGBM's solution works in the sense of providing rendered artifacts, I don't think it's an ideal solution here - the doc page URL differs from the Sphinx one and is not as easily indexable/searchable as an embedded .md or .ipynb file in the same sphinx page.

I've added a script and instructions to update the .md file here, although I'm not yet 100% sure that the .qmd vignette with the settings it has would render at CRAN without errors. Will hand over to @trivialfis from here. If this is still too complicated, maybe the logic could be switched to having the vignette as .Rmd and the script modifying the header and extension to .qmd to render the .md instead.

Still waiting from comments from @mayer79 if there are any.

@trivialfis
Copy link
Member

Let me update the branch to prevent CI hang on mgpu tests. It's a AWS instance.

@jameslamb
Copy link
Contributor

I don't think it's an ideal solution here - the doc page URL differs from the Sphinx one and is not as easily indexable/searchable as an embedded .md or .ipynb file in the same sphinx page.

Sure, those are good points. You have to balance whether those benefits you listed are worth the added build complexity, heavier set of dependencies, and increased risk of CRAN rejections. Not my call to make here in xgboost, and I don't have anything else to add.

@david-cortes david-cortes changed the title [WIP] [R] Replace vignettes and examples [R] Replace vignettes and examples Jan 13, 2025
@trivialfis
Copy link
Member

trivialfis commented Jan 13, 2025

I will take a look at the doc build tomorrow.

@trivialfis
Copy link
Member

Hi, @david-cortes. May I ask what qdm provides that rmd doesn't in terms of generating md files?

@trivialfis
Copy link
Member

trivialfis commented Jan 14, 2025

For now, I think pkgdown with html output is the way to go. For a true integration with sphinx, we need a sphinx extension for the R domain, which is beyond the scope of XGBoost.

We can continue to use knitr for vignettes unless qmd provides extra capabilities that rmd doesn't have.

For information about sphinx domain, please see https://www.sphinx-doc.org/en/master/usage/domains/index.html .

@david-cortes
Copy link
Contributor Author

Hi, @david-cortes. May I ask what qdm provides that rmd doesn't in terms of generating md files?

As far as I'm aware, .Rmd files are not compilable to .md and/or .ipynb which is what sphinx can support as embedded files to render, whereas .qmd can produce both formats.

Issue is: it appears to use jupyter to generate formats like .md and .ipynb - but I'm not familiar very familiar with either quarto or knitr so perhaps there might be other ways around it that I'm not aware of.

@david-cortes
Copy link
Contributor Author

@trivialfis Example of how the result would look like:
image

Would be ideal to find a way to keep it looking like this.

@trivialfis
Copy link
Member

Thank you for sharing!

We have a script to create markdown files, see changes in a #11166, specially around the makefile.

Although I'm not sure how knitr works.

I think the amount of time we need to spend to make a satisfactory result by various conversions will be about the same as writing a proper sphinx extension. As a result, I suggest we use pkgdown for now.

If you feel the sphinx integration is important for the R community, planning a future sphinx extension is better use of time than hacking various markdown flavors.

@trivialfis
Copy link
Member

Please note that the knitr script doesn't create R api reference, while the pkgdown does.

@david-cortes
Copy link
Contributor Author

Actually on a second look: turns out that knitr does support building to .md (but not to .ipynb unfortunately, so we couldn't add plots or similar), without IR. So no need to use .qmd.

Now the next question: could this be arranged in such a way that the .md file is built programmatically by the CI and commited to the repository?

@trivialfis
Copy link
Member

If you believe the vegnittes should be rendered as sphinx doc, we can continue to use knitr and leave the API references to pkgdown

@david-cortes
Copy link
Contributor Author

If you believe the vegnittes should be rendered as sphinx doc, we can continue to use knitr and leave the API references to pkgdown

Yes, that'd be ideal. But how would that be achieved?

@trivialfis
Copy link
Member

trivialfis commented Jan 14, 2025

Good question, I am sure I can enforce new commits to update the MD files, but I don't know if I can build the MD files automatically. In the referenced pr, the doc builder hangs.

@david-cortes
Copy link
Contributor Author

Good question, I am sure I can enforce new commits to update the MD files, but I don't know if I can build the MD files automatically. In the referenced pr, the doc builder hangs.

That might not be a good idea - the .md would change for example whenever the random numbers produced differ, or when there are numerical differences in results between different CI machines, and so on.

@trivialfis
Copy link
Member

Opened an issue for RTD readthedocs/readthedocs.org#11928

I will try to reproduce using their container image tomorrow.

@david-cortes
Copy link
Contributor Author

@trivialfis In the CI, the Makefile under doc/R-package/ doesn't appear to be executed by the Makefile from doc.

I don't see any trace of it in the RTD log.

@trivialfis
Copy link
Member

trivialfis commented Jan 14, 2025

It's manual, I have never looked into it before. In my pr, I added some changes to the conf.py file to run it during Docs build.

Hopefully there's a better way to do it instead of burdening the RTD server

@david-cortes
Copy link
Contributor Author

It's manual, I have never looked into it before. In my pr, I added some changes to the conf.py file to run it during Docs build.

Hopefully there's a better way to do it instead of burdening the RTD server

Ok, then let's leave the sphinx rendering for a different PR. I've removed all the references to .md files from this one, leaving only the part about adding a new vignette and examples.

@trivialfis trivialfis merged commit 1f1cf3a into dmlc:master Jan 15, 2025
57 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Introductory vignette is outdated
3 participants